In this class, we use the phrases statistical learning, machine learning, or simply learning interchangeably.
1.1 Supervised vs unsupervised learning
Supervised learning: input(s) -> output.
Prediction: the output is continuous (income, weight, bmi, …).
Classification: the output is categorical (disease or not, pattern recognition, …).
Unsupervised learning: no output. We learn relationships and structure in the data.
Clustering.
Dimension reduction.
1.2 Supervised learning
Predictors\[
X = \begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix}.
\] Also called inputs, covariates, regressors, features, independent variables.
Outcome\(Y\) (also called output, response variable, dependent variable, target).
In the regression problem, \(Y\) is quantitative (price, weight, bmi).
In the classification problem, \(Y\) is categorical. That is \(Y\) takes values in a finite, unordered set (survived/died, customer buy product or not, digit 0-9, object in image, cancer class of tissue sample).
We have training data \((\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\). These are observations (also called samples, instances, cases). Training data is often represented by a predictor matrix \[
\mathbf{X} = \begin{pmatrix}
x_{11} & \cdots & x_{1p} \\
\vdots & \ddots & \vdots \\
x_{n1} & \cdots & x_{np}
\end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix}
\tag{1}\]
and a response vector \[
\mathbf{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}
\]
Based on the training data, our goal is to
Accurately predict unseen outcome of test cases based on their predictors.
Understand which predictors affect the outcome, and how.
Assess the quality of our predictions and inferences.
1.2.1 Example: salary
The Wage data set collects the wage and other data for a group of 3000 male workers in the Mid-Atlantic region in 2003-2009.
Our goal is to establish the relationship between salary and demographic variables in population survey data.
Since wage is a quantitative variable, it is a regression problem.
library(gtsummary)library(ISLR2)library(tidyverse)# Convert to tibbleWage <-as_tibble(Wage) %>%print(width =Inf)
# A tibble: 3,000 × 11
year age maritl race education region
<int> <int> <fct> <fct> <fct> <fct>
1 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
2 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
3 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
4 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
5 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
6 2008 54 2. Married 1. White 4. College Grad 2. Middle Atlantic
7 2009 44 2. Married 4. Other 3. Some College 2. Middle Atlantic
8 2008 30 1. Never Married 3. Asian 3. Some College 2. Middle Atlantic
9 2006 41 1. Never Married 2. Black 3. Some College 2. Middle Atlantic
10 2004 52 2. Married 1. White 2. HS Grad 2. Middle Atlantic
jobclass health health_ins logwage wage
<fct> <fct> <fct> <dbl> <dbl>
1 1. Industrial 1. <=Good 2. No 4.32 75.0
2 2. Information 2. >=Very Good 2. No 4.26 70.5
3 1. Industrial 1. <=Good 1. Yes 4.88 131.
4 2. Information 2. >=Very Good 1. Yes 5.04 155.
5 2. Information 1. <=Good 1. Yes 4.32 75.0
6 2. Information 2. >=Very Good 1. Yes 4.85 127.
7 1. Industrial 2. >=Very Good 1. Yes 5.13 170.
8 2. Information 1. <=Good 1. Yes 4.72 112.
9 2. Information 2. >=Very Good 1. Yes 4.78 119.
10 2. Information 2. >=Very Good 1. Yes 4.86 129.
# ℹ 2,990 more rows
# Load the pandas libraryimport pandas as pd# Load numpy for array manipulationimport numpy as np# Load seaborn plotting libraryimport seaborn as snsimport matplotlib.pyplot as plt# Set font size in plotssns.set(font_scale =2)# Display all columnspd.set_option('display.max_columns', None)# Import Wage dataWage = pd.read_csv("./slides/data/Wage.csv", dtype = {'maritl':'category', 'race':'category','education':'category','region':'category','jobclass':'category','health':'category','health_ins':'category' } )Wage
year age maritl race education \
0 2006 18 1. Never Married 1. White 1. < HS Grad
1 2004 24 1. Never Married 1. White 4. College Grad
2 2003 45 2. Married 1. White 3. Some College
3 2003 43 2. Married 3. Asian 4. College Grad
4 2005 50 4. Divorced 1. White 2. HS Grad
... ... ... ... ... ...
2995 2008 44 2. Married 1. White 3. Some College
2996 2007 30 2. Married 1. White 2. HS Grad
2997 2005 27 2. Married 2. Black 1. < HS Grad
2998 2005 27 1. Never Married 1. White 3. Some College
2999 2009 55 5. Separated 1. White 2. HS Grad
region jobclass health health_ins logwage \
0 2. Middle Atlantic 1. Industrial 1. <=Good 2. No 4.318063
1 2. Middle Atlantic 2. Information 2. >=Very Good 2. No 4.255273
2 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes 4.875061
3 2. Middle Atlantic 2. Information 2. >=Very Good 1. Yes 5.041393
4 2. Middle Atlantic 2. Information 1. <=Good 1. Yes 4.318063
... ... ... ... ... ...
2995 2. Middle Atlantic 1. Industrial 2. >=Very Good 1. Yes 5.041393
2996 2. Middle Atlantic 1. Industrial 2. >=Very Good 2. No 4.602060
2997 2. Middle Atlantic 1. Industrial 1. <=Good 2. No 4.193125
2998 2. Middle Atlantic 1. Industrial 2. >=Very Good 1. Yes 4.477121
2999 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes 4.505150
wage
0 75.043154
1 70.476020
2 130.982177
3 154.685293
4 75.043154
... ...
2995 154.685293
2996 99.689464
2997 66.229408
2998 87.981033
2999 90.481913
[3000 rows x 11 columns]
# Plot wage ~ agesns.lmplot( data = Wage, x ="age", y ="wage", lowess =True, scatter_kws = {'alpha' : 0.1}, height =8 ).set( xlabel ='Age', ylabel ='Wage (k$)' )
Figure 1: Wage changes nonlinearly with age.
Figure 2: Wage changes nonlinearly with age.
# Plot wage ~ yearsns.lmplot( data = Wage, x ="year", y ="wage", scatter_kws = {'alpha' : 0.1}, height =8 ).set( xlabel ='Year', ylabel ='Wage (k$)' )
Figure 3: Average wage increases by $10k in 2003-2009.
Figure 4: Average wage increases by $10k in 2003-2009.
# Plot wage ~ educationax = sns.boxplot( data = Wage, x ="education", y ="wage" )ax.set( xlabel ='Education', ylabel ='Wage (k$)' )ax.set_xticklabels(ax.get_xticklabels(), rotation =15)
Figure 5: Wage increases with education level.
# Plot wage ~ raceax = sns.boxplot( data = Wage, x ="race", y ="wage" )ax.set( xlabel ='Race', ylabel ='Wage (k$)' )ax.set_xticklabels(ax.get_xticklabels(), rotation =15)
Figure 6: Any income inequality?
1.2.2 Example: stock market
Code
library(quantmod)SP500 <-getSymbols("^GSPC", src ="yahoo", auto.assign =FALSE, from ="2022-01-01",to ="2022-12-31")chartSeries(SP500, theme =chartTheme("white"),type ="line", log.scale =FALSE, TA =NULL)
The Smarket data set contains daily percentage returns for the S&P 500 stock index between 2001 and 2005.
Our goal is to predict whether the index will increase or decrease on a given day, using the past 5 days’ percentage changes in the index.
Since the outcome is binary (increase or decrease), it is a classification problem.
From the boxplots in Figure 7, it seems that the previous 5 days percentage returns do not discriminate whether today’s return is positive or negative.
# Data informationhelp(Smarket)# Convert to tibbleSmarket <-as_tibble(Smarket) %>%print(width =Inf)
# A tibble: 1,250 × 9
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 2001 0.381 -0.192 -2.62 -1.06 5.01 1.19 0.959 Up
2 2001 0.959 0.381 -0.192 -2.62 -1.06 1.30 1.03 Up
3 2001 1.03 0.959 0.381 -0.192 -2.62 1.41 -0.623 Down
4 2001 -0.623 1.03 0.959 0.381 -0.192 1.28 0.614 Up
5 2001 0.614 -0.623 1.03 0.959 0.381 1.21 0.213 Up
6 2001 0.213 0.614 -0.623 1.03 0.959 1.35 1.39 Up
7 2001 1.39 0.213 0.614 -0.623 1.03 1.44 -0.403 Down
8 2001 -0.403 1.39 0.213 0.614 -0.623 1.41 0.027 Up
9 2001 0.027 -0.403 1.39 0.213 0.614 1.16 1.30 Up
10 2001 1.30 0.027 -0.403 1.39 0.213 1.23 0.287 Up
# ℹ 1,240 more rows
# Summary statisticssummary(Smarket)
Year Lag1 Lag2 Lag3
Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
Lag4 Lag5 Volume Today
Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
Direction
Down:602
Up :648
# Plot Direction ~ Lag1, Direction ~ Lag2, ...Smarket %>%pivot_longer(cols = Lag1:Lag5, names_to ="Lag", values_to ="Perc") %>%ggplot() +geom_boxplot(mapping =aes(x = Direction, y = Perc)) +labs(x ="Today's Direction", y ="Percentage change in S&P",title ="Up and down of S&P doesn't depend on previous day(s)'s percentage of change." ) +facet_wrap(~ Lag)
Figure 7: LagX is the percentage return for the previous X days.
# Pivot to long format for facet plottingSmarket_long = pd.melt( Smarket, id_vars = ['Year', 'Volume', 'Today', 'Direction'], value_vars = ['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5'], var_name ='Lag', value_name ='Perc' )Smarket_long
Year Volume Today Direction Lag Perc
0 2001 1.19130 0.959 Up Lag1 0.381
1 2001 1.29650 1.032 Up Lag1 0.959
2 2001 1.41120 -0.623 Down Lag1 1.032
3 2001 1.27600 0.614 Up Lag1 -0.623
4 2001 1.20570 0.213 Up Lag1 0.614
... ... ... ... ... ... ...
6245 2005 1.88850 0.043 Up Lag5 -0.285
6246 2005 1.28581 -0.955 Down Lag5 -0.584
6247 2005 1.54047 0.130 Up Lag5 -0.024
6248 2005 1.42236 -0.298 Down Lag5 0.252
6249 2005 1.38254 -0.489 Down Lag5 0.422
[6250 rows x 6 columns]
g = sns.FacetGrid(Smarket_long, col ="Lag", col_wrap =3, height =10)g.map_dataframe(sns.boxplot, x ="Direction", y ="Perc")
plt.clf()
1.2.3 Real Example (1)
Development and validation of a bronchoalveolar lavage genomic classifier for acute cellular rejection. EBioMedicine. 2025 Dec;122:106046. doi: 10.1016/j.ebiom.2025.106046.
Lung transplant recipients are at risk for acute cellular rejection (ACR), which is a major cause of morbidity and mortality.
Genomic classifier
RNA Seq data (Transcriptome) is a high-dimensional data set.
1.2.7 Example: classify the pixels in a satellite image, by usage
Figure 9: LANDSET images (ESL Figure 13.6).
LANDSAT: 82x100 pixels. Four heat-map images, two in the visible spectrum and two in the infrared, for an area of agricultural land in Australia.
Each pixel has a class label from the 7-element set {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil}, determined manually by research assistants surveying the area. The objective is to classify the land usage at a pixel, based on the information in the four spectral bands.
1.3 Unsupervised learning
No outcome variable, just predictors.
Objective is more fuzzy: find groups that behave similarly, find features that behave similarly, find linear combinations of features with the most variations, generative models (transformers).
Difficult to know how well you are doing.
Can be useful in exploratory data analysis (EDA) or as a pre-processing step for supervised learning.
1.3.1 Example: gene expression
The NCI60 data set consists of 6,830 gene expression measurements for each of 64 cancer cell lines.
# Apply PCA using prcomp function# Need to scale / Normalize as# PCA depends on distance measureprcomp(NCI60$data, scale =TRUE, center =TRUE, retx = T)$x %>%as_tibble() %>%add_column(cancer_type = NCI60$labs) %>%# Plot PC2 vs PC1ggplot() +geom_point(mapping =aes(x = PC1, y = PC2, color = cancer_type)) +labs(title ="Gene expression profiles cluster according to cancer types")
# Plot PC2 vs PC1sns.relplot( kind ='scatter', data = nci60_pc, x ='PC1', y ='PC2', hue ='cancer_type', height =10 )
1.3.2 Example: mapping people from their genomes
The genetic makeup of \(n\) individuals can be represented by a matrix Equation 1, where \(x_{ij} \in \{0, 1, 2\}\) is the \(j\)-th genetic marker of the \(i\)-th individual.
Is that possible to visualize the geographic relationship of these individuals?
1805, least squares / linear regression / shallow learning by Gauss.
1936, classification by linear discriminant analysis by Fisher.
1940s, logistic regression.
Early 1970s, generalized linear models (GLMs).
Mid 1980s, classification and regression trees.
1980s, generalized additive models (GAMs).
1980s, neural networks gained popularity.
1990s, support vector machines.
2010s, deep learning.
2 Course logistics
2.1 Learning objectives
Understand what machine learning is (and isn’t).
Learn some foundational methods/tools.
For specific data problems, be able to choose methods that make sense.
Tip
Q: Wait, Dr. Zhou! Why don’t we just learn the best method (aka deep learning) first?
A: No single method dominates. One method may prove useful in answering some questions on a given data set. On a related (not identical) data set or question, another might prevail. Article, Article
2.2 Syllabus
Read syllabus and schedule for a tentative list of topics and course logistics.
Homework assignments will be a mix of theoretical/conceptual and applied/computational questions. Although not required, you are highly encouraged to practice literate programming (using Jupyter, Quarto, RMarkdown, or Google Colab) coordinated through Git/GitHub. This will enhance your GitHub profile and make you more appealing on job market.
Although, I do not require homework submission through Git/Github. Homework submission is through BruinLearn.
We will mainly use R in this course.
2.3 What I expect from you
You are curious and are excited about “figuring stuff out”.
You are proficient in coding and debugging (or are ready to work to get there).
You have a solid foundation in introductory statistics (or are ready to work to get there).
You are willing to ask questions.
2.4 What you can expect from me
I value your learning experience and process.
I’m flexible with respect to the topics we cover.
I’m happy to share my professional connections.
I’ll try my best to be responsive in class, in office hours, and other professional encounters.
3 Notation and Simple Matrix Algebra
3.1 Notation
We will use \(n\) to represent the number of distinct data points, or observations, in our sample.
We will let \(p\) denote the number of variables that are available for use in making predictions.
For example, the Wage data set consists of 11 variables for 3,000 people, so we have \(n = 3,000\) observations and \(p = 11\) variables (such as year, age, race, and more).
\(p\) can be quite large, such as on the order of thousands or even millions, e.g., modern biological data, like gene expression, DNA sequences along the genome.
We will let \(x_{ij}\) represent the value of the \(j\)th variable for the \(i\)th observation, where \(i = 1,2,\ldots,n\) and \(j = 1,2,\ldots,p\).
We will let \(i\) be the index of the samples or observations (from 1 to \(n\)) and \(j\) will be used to index the variables (from 1 to \(p\)).
We let \(\mathbf{X}\) denote an \(n \times p\) matrix whose \((i, j )\)th element is \(x_{ij}\)\[
\mathbf{X} = \begin{pmatrix}
x_{11} & \cdots & x_{1p} \\
\vdots & \ddots & \vdots \\
x_{n1} & \cdots & x_{np}
\end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix}
\]
Rows of X, which we write as \(x_1, x_2, \ldots , x_n\). Here \(x_i\) is a vector of length \(p\), containing the \(p\) variable measurements for the \(i\)th observation. That is, \[
x_i = \begin{pmatrix} x_{i1} \\ \vdots \\ x_{ip} \end{pmatrix}.
\]Note: Vectors are by default represented as columns.
At other times we will instead be interested in the columns of \(\mathbf{X}\), which we write as \(\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_p\). Each is a vector of length \(n\). That is, \[
\mathbf{x}_j = \begin{pmatrix} \mathbf{x}_{1j} \\ \vdots \\ \mathbf{x}_{nj} \end{pmatrix}.
\]
Using this notation, the matrix \(\mathbf{X}\) can be written as \[
\mathbf{X} = (\mathbf{x}_1 \quad \mathbf{x}_2 \quad \ldots \quad \mathbf{x}_p),
\] or \[
\mathbf{X} = \begin{pmatrix} x_{1}^T \\ \vdots \\ x_{n}^T \end{pmatrix}.
\]
We use \(y_i\) to denote the \(i\)th observation of the variable on which we wish to make predictions (i.e., “outcome”), such as wage. Hence, we write the set of all \(n\) observations in vector form as \[
\mathbf{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}
\] Then our observed data consists of \({(x_1, y_1), (x_2, y_2), \ldots , (x_n, y_n)}\), where each \(x_i\) is a vector of length \(p\). (If \(p = 1\), then \(x_i\) is simply a scalar.)
3.2 Matrix Algebra
Matrices will be denoted using bold capitals, such as \(\mathbf{A}\).
To indicate that an object is a scalar, we will use the notation \(a \in R\).
To indicate that it is a vector of length \(k\),we will use \(\mathbf{a}\in R^k\) (or \(\mathbf{a}\in R^n\) if it is of length \(n\)).
We will indicate that an object is an \(r \times s\) matrix using \(\mathbf{A} \in R^{r\times s}\).
3.2.1 Special cases of matrices
A column vector is a matrix with only one column, e.g. \[
\mathbf{A} = \left(\begin{array}{c}
1 \\
4 \\
0\\
-2\\
\end{array}\right)
\]
A row vector is a matrix with only one row, e.g. \[
\mathbf{A} = \left(\begin{array}{cccc}
1 & 4 & 0 & -2\\
\end{array}\right)
\]
A matrix with \(r = s\), that is, with the same number of rows and columns is called a square matrix. If a matrix is square, the elements \(a_{ii}\) are said to lie on the diagonal of . \[
\mathbf{A} = \left(\begin{array}{cc}
1 & 4 \\
0 & -2
\end{array}\right)
\]
A square matrix is called if \(a_{ij} = a_{ji}\) for all values of i and j. \[
\mathbf{A} = \left(\begin{array}{ccc}
3 &5& 7 \\
5 &1& 4 \\
7 &4 &8
\end{array}\right)
\] Symmetric matrices turn out to be quite important in formulating statistical models for all types of data!
An important special case of a square, symmetric matrix is the identity matrix, i.e., a square matrix with \(1\)s on diagonal, \(0\)s elsewhere, e.g. \[
\mathbf{A} = \left(\begin{array}{ccc}
1 & 0 & 0 \\
0 & 1 & 0 \\
0& 0& 1\\
\end{array}\right)
\] The identity matrix functions the same way as “\(1\)” does in the real number system.
3.2.2 Matrix operations
3.2.2.1 Transpose
The \(^T\) notation denotes the transpose of a matrix or vector
\[
\mathbf{X}^T = \begin{pmatrix}
x_{11} & \cdots & x_{1n} \\
\vdots & \ddots & \vdots \\
x_{p1} & \cdots & x_{pn}
\end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix}
\] So the transpose of a \(n\times p\) matrix is a \(p\times n\) matrix. That is, the transpose of \(A\) is the matrix found by ``flipping” the matrix around.
For example, \[
\mathbf{A} =
\left(
\begin{array}{ccc}
1 & 2 & 3 \\
4 & 5 & 6 \\
\end{array}
\right)
\quad
\mathbf{A}^T =
\left(
\begin{array}{cc}
1 & 4 \\
2 & 5 \\
3 & 6 \\
\end{array}
\right)
\] A fundamental property of a symmetric matrix is that the matrix and its transpose are the same; i.e., if \(\mathbf{A}\) is symmetric then \(\mathbf{A} = \mathbf{A}^T\). (Try it on the symmetric matrix above.)
3.2.2.2 Matrix Addition and Subtraction
Adding or subtracting two matrices are operations that are defined element-by-element. That is, to add to matrices, add their corresponding elements, e.g. \[
\mathbf{A} =
\left(
\begin{array}{cc}
1 & 2 \\
4 & 5 \\
\end{array}
\right)
\quad
\mathbf{B} =
\left(
\begin{array}{cc}
6 & 4 \\
2 & -1 \\
\end{array}
\right)
\] Then, \[
\mathbf{A} + \mathbf{B} =
\left(
\begin{array}{cc}
7 & 6 \\
6 & 4 \\
\end{array}
\right)
\quad
\mathbf{A} - \mathbf{B} =
\left(
\begin{array}{cc}
-5 &-2 \\
-2 & 6 \\
\end{array}
\right)
\] - Note that these operations only make sense if the two matrices have the same dimension; the operations are not defined otherwise.
3.2.2.3 Matrix Multiplication
The effect of multiplying a matrix \(\mathbf{A}\) with any dimension by a real number (scalar) \(b\), say, is to multiply each element in \(\mathbf{A}\) by \(b\). \[
3\left(
\begin{array}{cc}
1 & 2 \\
4 & 5 \\
\end{array}
\right) =
\left(
\begin{array}{cc}
3 & 6 \\
12 & 15 \\
\end{array}
\right)
\]
Number of columns of first matrix must = Number of rows of second matrix, e.g., \[
\mathbf{A} = \left(
\begin{array}{ccc}
1 & 2 &5 \\
4 & 5 &1 \\
\end{array}
\right) \quad
\mathbf{B} = \left(
\begin{array}{cc}
3 & 6 \\
2 & 5 \\
1 & 2 \\
\end{array}
\right)\\ \quad
\mathbf{C} = (c_{ij}) = \mathbf{AB} = \left(
\begin{array}{cc}
12 & 26 \\
23 & 51 \\
\end{array}
\right)
\]
Formally, if \(\mathbf{A}\) is \((r\times s)\) and \(\mathbf{B}\) is \((s\times q)\), then \(\mathbf{AB}\) is a \((r\times q)\) matrix with \((i,j)\)th element \[
\sum_{k=1}^s a_{ik}b_{kj}.
\]
For any matrix \(\mathbf{A}\), \(\mathbf{A}^T\mathbf{A}\) will be a square matrix.
The transpose of a matrix product: \((\mathbf{AB})^T=\mathbf{B}^T\mathbf{A}^T\).
3.2.3 Example
Consider a prediction model, e.g., wage data example: suppose that we have \(n\) pairs \((x_1,Y_1),\ldots,(x_n,Y_n)\), and we believe that, except for a random deviation, the relationship between the \(x\) (e.g., age) and the response \(\mathbf{Y}\) follows a straight line. That is, for \(j=1,\ldots,n\), we have \[
\mathbf{Y}_j = \beta_0 + \beta_1x_j + \epsilon_j,
\] where \(\epsilon_j\) is a random deviation representing the amount by which the actual observed response \(Y_j\) deviates from the exact straight line relationship. Defining, \[
\mathbf{X}= \left(
\begin{array}{cc}
1 & x_1 \\
1 & x_2 \\
\vdots & \vdots\\
1&x_n\\
\end{array}
\right),\quad
Y= \left(
\begin{array}{c}
Y_1 \\
Y_2 \\
\vdots \\
Y_n\\
\end{array}
\right),\quad
\epsilon= \left(
\begin{array}{c}
\epsilon_1 \\
\epsilon_2 \\
\vdots \\
\epsilon_n\\
\end{array}
\right),
\beta= \left(
\begin{array}{c}
\beta_0 \\
\beta_1 \\
\end{array}
\right),
\] we may express the model succinctly as \[
\mathbf{Y}=\mathbf{X}\beta +\epsilon.
\]